\documentclass[10pt]{article}
\usepackage[margin=1in]{geometry}
%\usepackage[utf8]{inputenc}
\usepackage{amsmath}
\usepackage{framed, color}
\usepackage{url}
\definecolor{shadecolor}{rgb}{1,1,1}
\usepackage{graphicx}
%\usepackage{subcaption}
\usepackage{listings}
\usepackage{color}
\usepackage{natbib}
\usepackage{listings}
\lstset{ %
language=R,                % choose the language of the code
basicstyle=\footnotesize,       % the size of the fonts that are used for the code
numbers=left,                   % where to put the line-numbers
numberstyle=\footnotesize,      % the size of the fonts that are used for the line-numbers
stepnumber=1,                   % the step between two line-numbers. If it is 1 each line will be numbered
numbersep=7pt,                  % how far the line-numbers are from the code
backgroundcolor=\color{white},  % choose the background color. You must add \usepackage{color}
showspaces=false,               % show spaces adding particular underscores
showstringspaces=false,         % underline spaces within strings
showtabs=false,                 % show tabs within strings adding particular underscores
frame=single,           % adds a frame around the code
tabsize=2,          % sets default tabsize to 2 spaces
captionpos=b,           % sets the caption-position to bottom
breaklines=true,        % sets automatic line breaking
breakatwhitespace=false,    % sets if automatic breaks should only happen at whitespace
escapeinside={(*@}{@*)},          % if you want to add a comment within your code
xleftmargin=.5in,
xrightmargin=.25in
}

\title{Week 7: Bayesian Inference and Parameter Estimation}
\author{Phong Le, Willem Zuidema}

\begin{document}
\lstset{language=R}
\renewcommand{\lstlistingname}{Code}

\maketitle


We are curious about some evens or things (such as a language)
and want to study their \textit{hidden} mechanisms (grammar) $G_{gold}$. 
A proper way to do is to collect a lot of data (sentences, dialogues) 
$D = \{x_1,x_2,...,x_n\}$
and then find a model $\hat{G}$ that best \textit{fits} (or explains) 
$D$. In this way, you expect that $\hat{G}$ is a `good' estimate of 
$G_{gold}$. 

In this lab, firstly, we will study one quality metric to measure 
the `degree of belief' that a model $G$ is a good estimate of $G_{gold}$ 
given observed data $D$: the posterior probability $P(G|D)$, and
how to compute it by using Bayesian inference. Then, we will examine 
two widely used estimation methods: Maximum Likelihood estimation (MLE) 
and Maximum A Posteriori estimation (MAP).



\paragraph{Required R Code} At \url{http://www.illc.uva.nl/LaCo/clas/fncm13/assignments/computerlab-week7/} 
you can find the R-files you need for this exercise.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Bayesian Inference}

In statistics, according to Wikipedia, Bayesian inference 
\begin{quote}
is a method of inference in which Bayes' rule is used to update the 
probability estimate for a hypothesis as additional evidence is acquired.
\end{quote}
In other words, Bayesian inference is to compute the posterior probability 
$P(G|D)$ based on the Bayes' rule
\begin{equation}
	P(G|D) = \frac{P(D|G)P(G)}{P(D)}
\end{equation}
where $P(G)$ is the prior probability of $G$ and $D$ is additional evidence.
In order to illustrate the method, let's examine the toy example below. 


\subsection*{Toy Example: Murder in Dam Square}
A man was found dead in Dam Square and two people, namely $A$ and $B$, are 
suspected. After 24h investigating, the police found four witnesses, one of them 
reported that he saw $A$ shooting the victim whereas the others said $B$.
However, because it was foggy at that time, the police estimate that those 
witnesses only $80\%$ correctly distinguished the two suspects. 
Our task is using Bayesian inference 
to help the police find out which one is the murderer, $A$ or $B$.

First of all, we need to model the problem mathematically. Let's denote
\begin{itemize}
	\item $P(X)$ the prior probability that $X$ is the murderer (note: $P(X=B) = 1 - P(X=A)$)
	\item $P(W_i|X)$ ($i=1..4$) the confidence of the i-th witness' vision. 
	Here, $P(W_i = X | X) = 0.8$.
	\item $P(X|W_{1,2,3,4})$ the posterior probability that $X$ is the murderer
	based on the evidence given by all the four witnesses. 
\end{itemize}
Our goal is to compute the posterior probability $P(X=A|W_1=A,W_2=B,W_3=B,W_4=B)$
by updating the posterior probability when additional evidence 
is given as follows
\begin{itemize}
	\item Step 0: when we don't have any evidence, we can only judge based on 
	the prior probability $P(X)$.

	\item Step 1: after the first witness reports, we update the posterior probability
	\[
		P(X| W_1=A) = \frac{P(W_1=A|X)P(X)}{P(W_1=A)}
	\]
	where $P(W_1=A) = \sum_{X\in\{A,B\}} P(W_1=A|X)P(X)$.
	\item Step 2: after the second witness reports, we update the posterior probability
	\[
		P(X| W_1=A, W_2=B) = \frac{P(W_2=B|X)P(X|W_1=A)}{P(W_2=B | W_1=A)}
	\]
	where $P(X| W_1=A)$ is computed in step 1. (Note: because $W_i,W_j$ with $i\neq j$
	are independent given $X$, $P(W_2=B|X,W_1) = P(W_2=B|X)$.)

	\item Step 3: after the third witness reports, we update the posterior probability
	\[
		P(X| W_1=A, W_2=B, W_3=B) = \frac{P(W_3=B|X)P(X|W_1=A,W_2=B)}{P(W_3=B | W_1=A,W_2=B)}
	\]
	where $P(X| W_1=A, W_2=B)$ is computed in step 2.
	
	\item Step 4: after the last witness reports, we update the posterior probability
	\[
		P(X| W_1=A, W_2=B, W_3=B, W_4=B) = \frac{P(W_4=B|X)P(X|W_1=A,W_2=B,W_3=B)}{P(W_4=B | W_1=A,W_2=B,W_3=B)}
	\]
	where $P(X| W_1=A, W_2=B,W_3=B)$ is computed in step 3.
	
\end{itemize}

\begin{framed}
Exercise 1: We set up the experiment as follows
\begin{lstlisting}
p.prior = c(0.5,0.5)	# P(X=A) = P(X=B) = 0.5
p.witness = 0.8			# P(W_i = X | X) = 0.8
witness = c(1,2,2,2)	# W1 = A, W2 = W3 = W4 = B
\end{lstlisting}
Then
\begin{itemize}
	\item Step 0: 
	\begin{lstlisting}
	# step 0
	p.poste = p.prior
	print(p.poste)
	\end{lstlisting}
	
	\item Step 1, 2, 3, 4:

	\begin{lstlisting}	
	# step i > 0
	for (i in 1:length(witness)) {
    	if (witness[i] == 1) # if the witness saw A
        	p.poste = p.poste * c(p.witness,1-p.witness)
	    else # if the witness saw B
    	    p.poste = p.poste * c(1-p.witness,p.witness)
	    p.poste = p.poste / sum(p.poste)	# normalize

    	print(p.poste)
	}
	\end{lstlisting}
\end{itemize}

Is the posterior probability at step 2 the same step 0? 
Explain why?

Based on the posterior probability after step 4, 
who is the most suspected?

\end{framed}

\begin{framed}
Exercise 2: In exercise 1, the prior distribution is uniform, 
that is because we haven't had any evidence yet. 
Now, assuming that $B$ is a beautiful girl and seems innocent
whereas $A$ looks scary with tattoos and piercings. It's quite 
reasonable to suspect $A$ more than $B$. Therefore, 
we adjust the prior distribution as follows
\begin{lstlisting}
p.prior = c(0.9,0.1)	# P(X=A) = 0.9, P(X=B) = 0.1
\end{lstlisting}
while keeping other parameters unchanged. Compute the posterior 
distribution as in exercise 1 and report what you get.

\end{framed}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Parameter Estimation}
\label{section parameter estimation}
In the previous section, we study how to use Bayesian inference to 
estimate a distribution. In this section, we will study how to select 
the `best' model given observed data.

\paragraph{Maximum Likelihood Estimation (MLE)} is a method to find values 
for model's parameters such that the likelihood given the observed data, 
e.g. the probability of the observed data given the model, is maximized
\begin{equation}
    \label{eqn mle}
    \hat{G}_{MLE} = \max_{G} P(D|G)
\end{equation}

\paragraph{Maximum A Posteriori (MAP) Estimation} on the other hand, 
is to maximize the posterior probability 
\begin{equation}
    \hat{G}_{MAP} = \max_{G} P(G|D) 
\end{equation}
According to the Bayes' theorem, we can compute posterior probability 
based on prior probability and likelihood, e.g. 
$P(G|D) = \frac{P(D|G)P(G)}{P(D)}$. Therefore
\begin{equation}
    \label{eqn map}
    \hat{G}_{MAP} = \max_{G} \frac{P(D|G)P(G)}{P(D)} = \max_{G} P(D|G)P(G)
\end{equation}
(because $P(D)$ is a constant in this case, we can freely drop it).

In order to easily compute $P(D|G)$ in Equation~\ref{eqn mle} and \ref{eqn map},
observed data are assumed to be \textit{independent and identically distributed}
(i.i.d), e.g. examples are independently drawn from the same distribution. 
Hence
\begin{equation}
    P(D=\{x_1,x_2,...,x_n\}|G) = \prod_{i=1}^n P(x_i|G)
\end{equation}
Now, Equation~\ref{eqn mle} and \ref{eqn map} respectively become
\footnote{Note that because $\log$ is a monotonically increasing function, 
$\max (a,b) = \max \big( \log(a) , \log(b) \big)$.}
\begin{equation}
    \label{eqn mle iid}
    \hat{G}_{MLE} = \max_{G} \prod_{i=1}^n P(x_i|G) 
    = \max_{G} \sum_{i=1}^n \log P(x_i|G)
\end{equation}
where the right hand side, $\sum_{i=1}^n \log P(x_i|G)$, 
is called \textit{log-likelihood}, and
\begin{equation}
    \label{eqn map iid}
    \hat{G}_{MAP} = \max_{G} P(G) \prod_{i=1}^n P(x_i|G) 
    = \max_{G} \big( \log P(G) + \sum_{i=1}^n \log P(x_i|G) \big)
\end{equation}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\subsection*{Toy Example}

In the following exercises, we will examine a very simple case: estimating 
the mean of a normal distribution $N(x;\mu, \sigma^2)$. The scenario is that, 
we draw a sample $D=\{x_1,...,x_n\}$ from $N(x;\mu_{gold}, \sigma_{gold}^2)$; then, we ask you to 
estimate $\mu_{gold}$. (Note that, in order to adapt the above equations, we need to 
replace probability by density.)

Recall that, if $x \sim N(\mu,\sigma^2)$ then
\begin{equation}
    p(x|\mu) = \frac{1}{2\sigma\sqrt{\pi}} \exp \big( -\frac{(x-\mu)^2}{2\sigma^2} \big) 
\end{equation}
hence 
\begin{equation}
    \log p(x|\mu) = -\frac{(x-\mu)^2}{2\sigma^2} + U
\end{equation}
where $U$ is a constant independent from $\mu$. Now, Equation~\ref{eqn mle iid}
and \ref{eqn map iid} respectively become
\begin{equation}
    \hat{\mu}_{MLE} = \max_{\mu} - \sum_{i=1}^n (x_i - \mu)^2 
    %= \min_{\mu} \sum_{i=1}^n (x_i - \mu)^2
\end{equation}
\begin{equation}
    \hat{\mu}_{MAP} = \max_{\mu} \big( \log p(\mu) - \sum_{i=1}^n (x_i - \mu)^2 \big)
    %= \min_{\mu} \big( \sum_{i=1}^n (x_i - \mu)^2 - \log p(\mu) \big)
\end{equation}

\begin{framed}
Exercise 3: The file `estimate\_mu.R' provides you with a
visualization tool for the estimation problem (with both MLE and MAP):
each time you press the Enter key, the program will draw an example 
from the gold model
and use it to update $\hat{\mu}_{MLE}$ and $\hat{\mu}_{MAP}$; 
after that, it will show a plot containing graphs of log-likelihood
and log posterior probability over $\mu$ 
and another plot containing 
graphs of $\hat{\mu}_{MLE}$ and $\hat{\mu}_{MAP}$ over sample size.

In this exercise, we assume that the prior distribution is also a normal 
distribution $p(\mu) = N(\mu;\mu_{\mu},\sigma_{\mu}^2)$

\begin{enumerate}
	\item First of all, you need to set values for parameters and 
	draw a sample
	\begin{lstlisting}
	mu.gold = 3				# mean
	sigma.gold = 10		# standard deviation
	n = 100					# sample size
	data = rnorm(n, mean = mu.gold, sd = sigmoid.gold)
	
	mean.mu = 2.5			# mean of mu (priori)
	sd.mu = 1				# standard deviation of mu (priori)
	\end{lstlisting}
	
    \item Before executing the file, try to predict how the graph 
     of log-likelihood over $\mu$ looks like, and 
     how the graph of log-posterior-probability over $\mu$ looks like when 
     (i) observed data are ignored and (ii) observed data are used.
    
    \item Load the file (\texttt{source("estimate\_mu.R")}), and then 
    execute \\
    \texttt{estimate.mu(data, sigma.gold, mean.mu, sd.mu, plot=T)}\\
    (note: the black lines are of MLE, the blue lines MAP). 
    Report what you get.

    \item It is proved that $\hat{\mu}_{MLE} = \frac{1}{n} \sum_{i=1}^n x_i$. 
    Confirm that by computing the sample average \texttt{sum(data)/n}.
    (Note: $\hat{\mu}_{MLE}$ computed by the program is rounded.)
    
    \item Change $p(\mu)$ with \texttt{mean.mu = -2, sd.mu = 1}
    then execute \texttt{estimate.mu(...)} again. 
    Now set \texttt{mean.mu = -2, sd.mu = 1000} then execute 
    \texttt{estimate.mu(...)}. 
    Do you have any conclusion about the effect of the prior distribution?    
\end{enumerate}

\end{framed}

\begin{framed}
Exercise 4: In this exercise, we will compare MLE to MAP by computing 
mean squared errors over sample size.

\begin{enumerate}
	\item First, we set up the experiment as in exercise 1
	\begin{lstlisting}
	n = 100
	mu.gold = 3
	sigma.gold = 10
	mean.mu = 2.5
	sd.mu = 1
	\end{lstlisting}
	
	Then, we compute mean squared errors of $m$ runs
	\begin{lstlisting}
	mse.mle = rep(0,n); mse.map = rep(0,n)
	m = 100
	
	for (i in 1:m) {
	   data = rnorm(n, mean = mu.gold, sd = sigma.gold)
	   mu.est = estimate.mu(data, sigma.gold, mean.mu, sd.mu, plot=F)
	   mse.mle = mse.mle + (mu.est$mu.mle - mu.gold)^2
	   mse.map = mse.map + (mu.est$mu.map - mu.gold)^2
	}

	mse.mle = sqrt(mse.mle) / m
	mse.map = sqrt(mse.map)	 / m
	\end{lstlisting}
	
	And finally plot the errors
	\begin{lstlisting}
	plot(1:n, mse.mle, type='l',ylim=c(min(min(mse.mle),min(mse.map)),max(max(mse.mle),max(mse.map))), xlab = 'sample size', ylab = 'MSE')
	lines(1:n, mse.map, col='blue')	
	\end{lstlisting}
	(Don't forget our notation: black is of MLE and blue MAP.)	 
	
	\item Set \texttt{n = 3000} and rerun the above. 
	
	\item Based on what you have done so far, draw conclusions about 
	MLE vs MAP and when MAP is useful. 
	
\end{enumerate}


\end{framed}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\section{Statistical Inference with Sampling methods}

%According to Wikipedia, Statistical Inference 
%\begin{quote}
%is the process of drawing conclusions from data that are subject to random variation, 
%for example, observational errors or sampling variation.
%\end{quote}
%So what we do in Section~\ref{section parameter estimation} is statistical inference:
%we draw 





%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{Submission}
You have to submit a file named `your\_name.pdf'.
The deadline is 15:00 Monday 16 Dec.
If you have any questions, contact Phong Le (p.le@uva.nl).

\bibliographystyle{apalike}
\bibliography{ref}
\end{document}

